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CJ I Abstract 

\-^ , Asynchronous methods for solving systems of linear equations have been researched since Chazan 

c/2 ' and Miranker published their pioneering paper on chaotic relaxation in 1969. The underlying idea of 

asynchronous methods is to avoid processor idle time by allowing the processors to continue to work and 
make progress even if not all progress made by other processors has been communicated to them. 

Historically, work on asynchronous methods for solving linear equations focused on proving conver- 
gence in the limit. How the rate of convergence compares to the rate of convergence of the synchronous 
counterparts, and how it scales when the number of processors increase, was seldom studied and is still 
not well understood. Furthermore, the applicability of these methods was limited to restricted classes of 

^2 , matrices (e.g., diagonally dominant matrices). 

We propose a shared-memory asynchronous method for general symmetric positive definite matrices. 
We rigorously analyze the convergence rate and prove that it is linear and close to that of our method's 

C*~) ' synchronous counterpart as long as not too many processors are used (relative to the size and sparsity 

of the matrix). A key component is randomization, which allows the processors to make guaranteed 
progress without introducing synchronization. Our analysis shows a convergence rate that is linear in the 
condition number of the matrix, and depends on the number of processors and the degree to which the 
matrix is sparse. 
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1 Introduction 

Asynchronous methods for solving linear equations have been researched since Chazan and Miranker pub- 
lished their pioneering paper on chaotic relaxation in 1969 [4] (see review by Frommer and Szyld [|6])- The 
underlying idea of asynchronous methods is to allow processors to continue to work even if not all progress 
made by other processors has been communicated to them, thereby eliminating synchronization points and 
their associated cost. 

While asynchronous methods were successfully applied to many numerical problems [61, interest in them 
dwindled over the years. One reason is related to convergence rate. Historically, work on asynchronous meth- 
ods for solving linear equations focused on proving convergence in the limit. How the rate of convergence 
compares to the rate of convergence of the synchronous counterparts, and how it scales when the number 
of processors increase, was seldom studied and is still not well understood. It was observed experimentally 
that asynchronous methods can sometimes be substantially slower than their synchronous counterparts [|3]. 
Furthermore, the applicability of existing asynchronous methods is limited to rather restricted classes of 
matrices (e.g., diagonally dominant matrices). 

We propose a shared-memory asynchronous method for symmetric positive definite matrices with a 
provable linear convergence rate under a mostly asynchronous computational model which assumes bounded 
delays. A key component of our algorithm is randomization, which allows the processors to make progress 
independently with only a small probability of interfering with each other. Our analysis shows a convergence 
rate that is linear in the condition number of the matrix, and depends on the number of processors and the 
degree to which the matrix is sparse. A slightly better bound is achieved if we occasionally synchronize the 
processors. In either case, as long the number of processors used is not too large (relative to the size and 
sparsity of the matrix), the convergence rate is close to that of the synchronous counterpart. Unlike previous 
asynchronous methods, the convergence rate does not depend on numerical classification of the matrix (e.g., 
diagonal dominance). In particular, our method will converge for essentially any large sparse symmetric 
positive definite matrix as long as not too many processors are used. We discuss previous work, and contrast 
them to ours, in Section |7J after we describe our results in detail. 

The remainder of the paper is organized as follows. We describe the basic setup and give essential 
background on randomized Gauss-Seidel in Section [2] In Section [3] we propose two asynchronous models 
for executing randomized Gauss-Seidel: one assumes that consistent reads have been enforced, and one does 
not. Section [4] analyzes the convergence when consistent read is enforced. Section |5]shows that convergence 
can be improved if we control the step-size. In Section [6] we analyze convergence rate when we allow 
inconsistent reads. Finally, in Section [7] we discuss previous results and how our results relate to them. 

2 Preliminaries 

2.1 Setup and Notation 

This paper concerns with solving the linear equation Ax = b where A € M nxn is a symmetric positive 
definite matrix, and b G R n . For simplicity we will assume that A has a unit diagonal. This is easily 
accomplished using re-scaling. Our results can be easily generalized to allow an arbitrary diagonal, but 
making this assumption helps keep the presentation and notation more manageable. We denote the exact 
solution to this equation by x*, i.e. x* = A _1 b. We denote the largest eigenvalue of A by A max , and the 
smallest eigenvalue by A m i n . The condition number of A, which is equal to A max /A m i n , is denoted by n. 



We are predominantly interested in the case that A is sparse and very large, and the number of non-zeros 
in each row is between C\ and C2 <C n with a small ratio between C2 and C\. This scenario frequently 
occurs in many scientific computing applications. Throughout the paper we refer to this scenario as the 
reference scenario. We state and prove more general results; we do not use the properties of the reference 
scenario in the proofs. The reference scenario is mainly useful for the interpretation of the practical implica- 
tions of the results. Note that in the reference scenario we have A max < C2 <C n, as A has a unit-diagonal 
(so off-diagonal entries must be smaller than or equal to one). 

We use (•, -)a to denote the A inner product. That is, (x, y)A = y T Ax where x, y € M" '. The fact that 
A is a symmetric positive definite matrix guarantees that (•, -)a is an inner product. The A-norm is defined 
by ||x||a = a/(x, x)a- We use e' 1 ', e^ , . . . , e( n ) to denote the ?i-dimensional identity vectors (i.e. eW is 
one at at position i and zero elsewhere). A, denotes row i of A, and A^ denotes the i, j entry of A. We will 
generally use subscript indexes on vectors for iteration counters. The notation (x)j denotes the ith entry of 
x. 

Throughout the paper we describe algorithms that generate a series of approximations to x*, denoted 
by xo,xi, . . . (subscript index is the iteration counter), which are actually random vectors. We denote the 
expected squared A-norm of the error of x m by E m . That is, 

£ m = E[||x m -x*|| A ] • 

2.2 Randomized Gauss-Seidel 

Our asynchronous algorithm is based on the randomized variant of the Gauss-Seidel iteration, originally 
proposed by Leventhal and Lewis (H (see also Griebel and Oswald Q). The goal of this section is to 
describe and review the basic properties of the randomized Gauss-Seidel iteration. 

Consider the following iteration applied to some arbitrary initial vector xo G W n , and a series of direction 
vectors do,di, . . . : 

r, = b — Axi 



v j 



15 = d J r i 



j 
x i+1 = xj + 7,-dj . 

In terms of the analysis it is more convenient to write the iteration in the following equivalent form: 

7j = (x*-x i ,d i )A 

x i+ i = x i + 7j d i . 

The reason for listing both iterations is to show that even though x* (which is unknown) appears in dTJ, the 
iteration is computable. 

In CQ) the scalars 70,71, • . . are selected so as to minimize ||x* — x j+ i||a when Xj+i is obtained from 
Xj by taking a step in the direction dj. There are quite a few ways to set do, di, ... . Each is associated 
with a different per-iteration cost, and different convergence properties. One well known method is setting 
dj = e((* mod n ) +1 ). In that case, every n iterations corresponds to a single iteration of Gauss-Seidel (recall 
that we assume that the matrix has unit diagonal). 

Leventhal and Lewis suggested using random directions instead of deterministic ones: dp, di, . . . are 



Algorithm 1 Randomized Gauss-Seidel 



Input: A £ M" xn , b £ K", (pointer to) vector x (initial approximation and algorithm output) . 

for j = 1,2, . . . do 

Pick a random r uniformly over {1, . . . , n} 

Read the entries of x corresponding to non-zero entries in A r 

Using these entries, compute 7 «— (b) r — A r x 

Update: (x) r <— (x) r + 7 
end for 

i.i.d. random vectors, taking e^ , . . . , e( n ) with equal probability^. With this distribution of direction vectors, 
they proved the following bound on the expected error in the A-norm (8] : 

£ m <(l-^p) m ||x -xli. 

So, the randomized Gauss-Seidel iteration converges in expectation at a linear- rat<§ Note that the expected 
cost per iteration of randomized Gauss-Seidel is 0(nnz (A) /ri), so n iterations are about as costly has a 
single Gauss-Seidel iteration. The proof of the last bound relies on the following lemma, which we use 
extensively in our analysis as welo 

Lemma 2.1. Let d be a random vector taking &• ' , . . . , e^ n ^ with equal probability. Suppose that x and d 
are independent. Then, 

^1EE [||x - x*||i] < E [(x - x*, d)i] < ^E [||x - x*||i] . 

3 Asynchronous Randomized Gauss-Seidel 

Algorithm [Q contains a pseudo-code description of randomized Gauss-Seidel in which we made the 
read and update operations explicit. This obviously entails some details that are, in a sense, implementa- 
tion specific. There are implementations of the randomized Gauss-Seidel iteration which do not match the 
description in Algorithm [Q 

Consider a shared memory model with P processors. Each processor follows Algorithm [T] using the 
same x, i.e. all processors read and update the same x, which is stored in shared memory. The processors 
do not explicitly coordinate or synchronize their iterations. We do, however, impose assumptions, some 
of which may require enforcement in an actual implementation. The first assumption is rather simple: the 
update operation in each iteration is atomic. 



'Leventhal and Lewis consider the more general setting where A does not have unit diagonal. For that case, they analyze non- 
uniform probabilities. When the matrix has unit diagonal, their algorithm and the convergence analysis reduces to the ones stated 
here. 

"Some care should be employed with terminology. Some mathematicians or computer scientists might say this is an exponential 
or geometric convergence rate. However, numerical analysts refer to this rate as linear, as it is linear in 0(log(e)) where e is the 
desired reduction factor of the error. 

3 The upper bound in the following lemma was not proven by Leventhal and Lewis (8), but it can be proved using the same 
technique they used to prove the lower bound. For completeness we include a proof in the appendix. 



Assumption 3.1 (Atomic Write). The update operation in line 7 is atomic. 

The update operation operates on a single coordinate in x. For single precision or double precision 
floating point reals, updates of the form used in line 7 have hardware support on many modern processors 
(e.g. compare-and-exchange on recent Intel processors). 

If atomic write is enforced, then for the sake of the analysis we can impose an order xo , xi , X2 , . . . on 
the values that x takes during the computation. Here Xj denotes the value of x after j updates have been 
applied (breaking ties in an arbitrary manner). 

We now turn our attention to the read operation in line 5. Here we consider two possible models. In the 
first model, we assume the following consistent read assumption is enforced. 

Assumption 3.2 (Consistent Read). The values of the entries of x read in line 5 appeared together in x at 
some time before the update operation (line 7) is executed. 



Note that assumption 13 . 2 1 does not imply that none of the entries read during the execution of line 5 are 
modified while that line is being executed, but the opposite does hold (and this is one way of enforcing this 
assumption). 

With consistent read we can denote by k(j) < j the maximum iteration index such x.k(j) * s e( l ua l t0 
the values read on line 5, on the indexes read during the execution of line 5. The existence of such a k(j) 
is guaranteed by assumption I3.2l (since all writes are atomic, all time intervals correspond to some iteration 
index). The iteration can then be written: 

1i= (x*-Xfc(j),dj)A 

We also consider a model where we allow inconsistent reads. Since every iteration changes a single 
coordinate, and we require all writes to be atomic, the value of x read in line 5 is the result of a subset 
of the updates that occurred before the write operation in line 7 is executed. Let us denote by K(j) C 
{0, 1, ... ,j — 1} a maximal set of updates consistent with the computation of 7 in iteration j (a formal 
definition of K(j) is as follow: an update index i is in K(j) if either it updates an entry of x not read for 
computing jj , or it updates an entry and the update was applied before that entry was read) . The vector read 
is then 

idK(j) 

The iteration can then be written as 

1i = ( x *- x ^(i)> d i)A 

Xj + l = Xj + -fjdj . 

Obviously, enforcing consistent read involves some overhead. However the bounds we obtain for the 
inconsistent read model are not as good as the ones achieved when we assume consistent reads. There is 
clearly a trade-off here, which we present but do not attempt to quantify. It is a complex trade-off that 
depends on many factors, including possible hardware features like transactional memory that may enable 
efficient enforcement of consistent reads. We do however note that in many cases even without any special 
provisions, the probability of an inconsistent read in a single iteration is extremely small, so much that we 



do not expect it to happen much (or at all) in a normal execution of the algorithm. The reason is that in order 
to have an inconsistent read in a single iteration there has to be at least two updates to entries read during 
that iteration. Consider again the reference scenario. Each iteration reads at most C2 <C n entries. Suppose 
there are u updates while reading those entries. Each such update affects a single random entry. Therefore, 
the probability that it will update one of the C2 entries being read is at most C^/n. The probability of getting 
two such updates is bounded by the probability of getting at least two in a binomial distribution with u 
experiments and probability C^/n. Unless u is very large, this is an extremely small probability (since C2/n 
is tiny). 

We are mainly interested in algorithms with provable convergence rates. In a totally asynchronous model 
with arbitrary delays, there can also be an arbitrary delay in convergence. Therefore, we assume that asyn- 
chronism is bounded in the sense that delays are bounded. 

Assumption 3.3 (Bounded Asynchronism). There is a constant r (measure of asynchronism) such that 
all updates that are older than r iterations participate in the computation of iteration j, for all iterations 

3 = 1,2,.... 

In the consistent read model, this assumption translates to requiring that 

j-r< k(j) < j . (3) 

In the inconsistent read model, this assumption translates to requiring that 

{0, 1, ... , max{0, j - r - 1}} C K(j) . (4) 

When the the variance in the number of non-zeros per row is not too large (with respect to the mean), as is 
the case in our reference scenario, then the time spent per iteration is roughly uniform. Therefore, in that 
case we expect r to be of order of P. 

We now discuss the relation between k(0), k(l), . . . or K(0),K(1), . . . and the random variables do, di, . 
If we inspect the pseudo-code of Algorithm Q] closely we will realize that k(j) or K(j) (depending on the 
model) depend on the random choices do, di, . . . , dj_i made before the write operation, and more crucially 
on the random choice dj. The reason is that on line 5 we read only the relevant entries of x, so only a small 
set of updates can be considered for inclusion, and this set of relevant entries is determined by the selection of 
dj. However, a completely adversarial model which allows dependence of k(j) (or K(j)) on do, di, . . . , dj 
(for j = 1,2,...) and analyzes the worst-case behavior is not likely to be very faithful to the actual behavior 
of the algorithm. Therefore, we assume the delays are independent of the random choices, but allow them to 
be arbitrary (as long as the bounded asynchronism assumption holds). 

Assumption 3.4 (Independent Delays). We allow an arbitrary set of delays that satisfy ((3]) or (01) (depending 
on the context), but they do not depend on the random choices do, dx, 

4 Convergence Bound With Consistent Read 

In this section we analyze iteration ©, which corresponds to the consistent read model (which assumes 
both assumptions 13.11 and 13.21) . under assumptions I3.3l and l3.4l We first state and prove the results, and then 
discuss them. 



Theorem 4.1. Consider iteration ©/or an arbitrary starting vector xo, where do, di, . . . are i.i.d. vectors 

that take e' 1 ) , . . . , e^ n ' with equal probability, and k(0), k(l), . . . are such that (0 holds but are independent 
of the random choices of do, d\, ... . Let p = max^ {^ Ylr=i l-A-Zrl}- Provided that 2pr < 1, the following 
holds: 

(a) For every m > . ' og( , 1/2) , . « %^ we Aave 

1 ' J — log(l-A max /n) A max 



(6J Le?T 



log(l/2) 
log(l-A max /n) 



and T = Tq + t. For every m > rT (r = 1, 2, . . . J we have 



where 

_ pT 2 A max (l - A max /n)~ 2r 

A 

n 
Proof. In the proof we use the following abbreviations: 



v T = 1 — 2pr 



Simple algebraic manipulations show that (see appendix for details): 

||x i+ i - x*|| A = ||xj - x*|| A - (x fc(i) - x*,dj) A - 2(x fc(i) - x*,d j ) A (x i - x fc(i) , d^A ■ (5) 

We see that the error decreases by a "progress term" ((x^y) — x*, dj) A ) which is always positive, and 
changes by an additional term (2(x fe (,) — x*, dj) A (^-k(j) ~ x j> dj) A ) which might be positive or negative. 
When the iterations are synchronized (k(j) = j) there is no additional term, and the analysis reduces to the 
analysis of synchronous randomized Gauss-Seidel. We first bound the additional term: 

i-i 
2(x fc(i) - x*, dj) A (jCj - x fc (,-), dj) A = 2(x fe(i) - x*, d j ) A ( ^ j t d t , dj) A 

t=k(j) 

3-1 

= Yl 2 ( Xfc (i) -x*,d j )A(x* -x fc(t ),d t ) A (d t ,dj)A (6) 

t=fc(j) 

i-i 

^ " E [( x fc(i)- x *' d i)Al(dt,d J -)A| + (xfc( t) -x*,d t )i|(d t ,d J -)A|] • 
t=k(j) 



Since k(j) < t < j: 

E[(x fc(3 - ) -x*,d i )i[(d t> d i )A[] = E[E[(x fe(j) -x*,d i )i|(d t ,d i )A||d ,. .,d t _i]] 

= E 



E 



< pE 



^EE(^)-^,e«)i|(e«,e«) A 

1=1 r=l 
1 n n 

^EE( x m;)- x > (0 )aIAz ; 

= pE[(x fc(j - ) -x*,d J -) A ] 



Z=l r=l 
IX> fc(j) -x*,e(0)2 



J=i 



Similarly, E [(x fe ( t ) — x*, dj) A |(d t ,dj) A |] < pE [(x fe ( t ) — x*, dt) A ], Taking expectation of © and ap- 
plying the last inequality we find that 

j'-i 

E [2(x fc(i) - x*, djjACxj - x fe(j) , dj) A ] > -p J2 i E t( x fc(i) ~ x *' d j)i] + E [( x fc« ~ x *' d *)i]] 

t=k(j) 

3-1 

= -plJ-kWlElfaw-x^dtf^-p Y, E[(x fc(t) -x*,d t )i] 

> -prE [(x fc(j) - x*, d,)i] ~pJ2 E i^Ht) ~ x *> d *)l] • 

t=k(j) 

Taking expectation of ©, and plugging in the last inequality, we find that 



i-i 



£, +1 < £,- - (1 - pr)E [(x fc(j) - x*, d,)l] + p E E K X *M " x *' d *)l] 

t=*0*) 



(7) 



Unrolling the recursion, we find that for every m 

m—l 



to— 1 i— 1 

E m <E -(l- pr) J2 E [( x fc(i) " x *' d *)l] + P E E E [( X M<) " x *' d *)l] • 

j=0 t=fc(i) 



i=0 



In the last sum of the previous inequality (p^^Lo J2l=k(i) E [( x fc(i) ~~ x *> dj) A ]), eacn term °f th e fo fm 
E [(x fc ( r ) — x*, d r ) A ] appears at most r times, each time with a coefficient p, so 

TO— 1 

£m < ^0 - (1 - 2pr) J2 E [( x fc« " x *> d *)l] ■ 

i=0 

We now apply the bound E [(xj^) — x*, dj) A ] > (Amin/n)^^) (Lemma l2~Tb . to find that 



to— 1 
^m < -E-0 — ^min /_, -^fe(i) • 

4 = 



(8) 



Proof of (a). Lemma |2T1 implies that for any b > a we have E\, > (^"^(see proof in Appendix), so in 
particular since i > k(i), 

E k (i) > S^E > 5 l max E . (9) 

Plugging © into ® we get the following inequality, which leads immediately to assertion (a): 

E,n <(l- £mi„ E ^max J E Q = (l ~ ^jjf^ ) ^) = (1 " ^(l " Cax))^0 ■ 

Proof of (b).Let 

Cj = {rT + i-T<t< rT + i-l : t > rT} 

and 

Di = {rT + i-T<t< rT + i-l : t < rT} . 

Unrolling the recursion in equation (£7) starting at rT, we find that for r > 1 and w > 

T-l+w T-l+w rT+i-l 

E (r+1)T+W < E rT -(l-p T ) Y E[(x fc(rT+i) -x*,d rT+i )i] + p Y J2 E [( x fc(t)- X ^' d i)i] 

t=0 i=0 i=fc(rT+i) 

T-l+w T-l+w rT+i-l 

< ^ rT _(i_ pr ) J2 E[(x MrT+i) -x*,d rT+i )i]+p ^ J^ E[(x fc(t) -x*,d t )i] 

t=0 i=0 t=rT+i-T 

T-l+w T-l+w 

< E rT -{l-pr) Y, E[(x fe(rT+i) -x*,d rT+i )i]+p ^ E E [( X ^)- X *> d *)i] 

t=o i=o teQ 

T-l+w 

+p Y X>[(x fe(t )-x*,d t )i] 

1=0 iGDi 

T-l+w r-1 

< SrT _ (i _ 2y0T ) ]T E [(x fc(rT+i) - x*, d rT+i )i] + p^ J^ E [(x fc(i) - x*, d t )i] • 

i=0 «=0 ieD, 

T-l+w r-1 

< E rT -(l-2pr) Y E [(x i(rT+i) -x*,(W +i )i] +/) J J E [(x i(() -x*,d t )i] . (10) 

i=r i=0 teDj 

The second-to-last inequality follows from the fact that each term of the form E [(x fc m — x*, dj)^l appears 

at most r times in p Yli=o +W YlteC ^ [( x fc(t) ~~ x *i dt) a] . We also use the fact that for i > r we trivially 

have Di = 0. 

We first bound E rT - (1 - 2pr) Ya=t +W e [(*k(rT+i) - x*, d r r+j)i] . Using Lemma|2j] 

T-l+w T-l+w 

E r T - (1 - 2pr) 2^ E [(x fc ( rT+i ) - x*, d rT+i ) A ] < E rT - 5 min Y, E k ^. T+() . 

i=T i=r 

]c(tT+%) vT 

Since i > r we have /c(/cT + i) > kT so E k ^ rT+i ^ > 5 max -E^T > ^max-S'rT- Therefore 

T-l+w T-1+w-t 

£ rT " (1 " 2f)T) Y E [( X fc(rT+ 4 ) " X*, d rT+i )i] < (1 - <5 min <^ ax Y 5 max)£rT ■ 
i=r i=0 



Noticing that T — \-\-w — t = Tq — \ + w and bounding the geometric sum as in assertion (a), we find that 

(*■ °min«max Z^j=0 "max/ — \ L 2k /' bU 

T-l+w „ 

E rT - (1 - 2pr) J^ E [(x fc(rT+i) - x*, d rT+i )l] < (1 - - J2 ^ L )ErT • (11) 

We now bound pY7iZo YlteD- ^ [( x fc(t) ~~ x *) d t )^J . Recall that for every 6 > a we have E^ > (fex-Eaj 
so, for i = 0, . . . , r — 1 and i G D, we have 

#fc(t) < <5 m ( ^T r £ r T < <5 max £ rT . 

The last inequality follows from the fact that for t £ Di, k(t) — rT > —2t and <5 max < 1- We now bound 

pf. E E [(**<*) - x ^' d *)i] <pEE ^^^t < ^^7 5 ™ E rT = xErT . 

i=0 teDt i=0 t£Di 

Combine the last inequality with (fTTb and assertion (a) to complete the proof of assertion (b). □ 

Discussion: 

• Assertion (a) shows that after we perform enough asynchronous iterations, we are guaranteed to reduce 
the expected error by a constant factor. In order to drive the expected error down to an arbitrary fraction 
of the input error, we can adopt the following scheme. We start with asynchronous iterations. After 
n iterations have been completed we synchronize the threads and restart the iterations. The matrix A 
has unit diagonal, so A max > 1. Therefore, by performing k > n iterations, we are guaranteeing a 
1 — v t /2k factor reduction in the expected error. We then continue to iterate and synchronize until 
the expected error is guaranteed to be small enough. The number of outer iterations until convergence 
(reduction of error by a predetermined factor) is 0{k/v t ). This is also the number of synchronization 
points. When v T is close to one, the number of synchronization points is asymptotically the same as in 
Jacobi, but the convergence rate is that of Gauss-Seidel. Furthermore, we do not need to really divide 
the iterations between processors (basically, every processor can do as many iterations as it can, until 
synchronization) and it is not important to synchronize exactly after n iterations. So, from a practical 
perspective, a time based scheme for synchronizing the processors should be sufficient, and will not 
suffer from large wait times due to load imbalance. 

• Assertion (b) shows that even if we do not occasionally synchronize the threads, we still get long-term 
linear convergence, but at a slower rate. We say convergence is linear in the long term since we cannot 
guarantee a diminishing bound in every iteration, but we can prove a constant factor reduction over a 
large enough amount of iterations. 

• The terms <5 max and 5~5j£ that appear in assertion (b) might seem problematic as they are exponential 
in the number of processors (because r = £l(P)). However, in our reference scenario this is not an 
issue since in that scenario A max = O(l) and P and r are much smaller than n, so <5 max and <5~ ax are 
actually very close to 1. 



• The number of iterations to guarantee a 1 — v t )2k reduction of expected error (as in assertion (a)) in 
synchronous randomized Gauss-Seidel is approximately u T n/2X max . 

• Consider our reference scenario in a weak-scaling regime (i.e., P « en for a very small c). In that 
case u T is constant and close to one, so if we occasionally synchronize the threads we have a constant 
percent increase in the number of iterations due to asynchronism. That is, the asynchronous phases do 
not violate the weak-scaling (but the number of iterations can increase due to A m i n becoming smaller). 
As for the case where only asynchronous iterations are used, we have \ ~ c2 ^max- So, x itself exhibits 
weak scaling. However, its value should be interpreted with respect to k^ 1 . If A m ; n shrinks as n grows, 
as is the case in many applications, then the relative size of x grows and we do not have weak scaling. 

5 Improving Scalability By Controlling Step-Size 

The bound in Theorem 14. 1 I requires 2pr < 1. In this section we show that by introducing a step-size we can 
have a convergent method for any delay (as long as we set the step size small enough). By optimizing the 
step-size we can also improve the scaling (dependence on r) in our bounds. 

The idea in introducing a step-size is that instead of taking a full step, we take a partial step by multiplying 
it by a step-size /5. That is, we now consider the iteration 

X i+ i = Xj + pnfjdj 

(jj is defined as before). Again, simple algebraic manipulations show that (see appendix for details): 

l|xj'+l- x *lli = ||x i -x*||^-/3(2-/3)(x fc(i) -x*,d j )^-2/3(x fc(i) -x*,d i ) A (x j -x fc(i) ,d i )A. (12) 

As before, we continue with bounding the additional term. 

j'-i 

2/3(x fc(i) - x*, dj) A (xj - x fc(i) , d,) A = 2/3(x fc(j) - x*, dj) A ( ^ Pjtdt, dj) A (13) 

t=k(j) 

3-1 

= 1 ^2 2 ( x fc(i) -x*,d.j) A (x* -x fc ( t ),d t ) A (d t ,dj) A 
o-i t=k(j) 

>-/3 2 £ [(x fc0 - ) -x*,d i )i[(d t ,d i )A[+(x fc ( t )-x*,d t )i|(d t ,d i ) A |] • 
t=k(j) 

We see that the progress term is 0{j3), but the additional term is 0(/3 2 ). In synchronous randomized 
Gauss-Seidel the best bound on the expected error is achieved with (3 = 1, but for an asynchronous compu- 
tation the best bound is achieved with some /3 < 1 (depending on t). Continuing along the lines of the proof 
of Theorem 14. l| (we omit the details), we find the following modified bounds: 

(a) :Em <(l- " A ^ - ^ . i, wi ^(/3) W1 M/3)^ ax . ..^r-l 

where 



(a) : E m < 1 - -^ E , (b) : E m < (1 - -^)(1 - ^J^i + x (f3)y^E 
\ 2k I 2k 2k 



v t ((3) = 2/3-/3 2 -2 P t/3 2 x (J3) = E1^ 



n 



Discussion: 
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• We see that for a sufficiently small /3 both bounds are useful, but the computation of the optimal /3 for 
assertion (b) (in terms of the bound) requires some approximation of the condition number. 

• Alternatively, we can optimize only the value of u r {f3). The optimum of that term is achieved at 
J3 = 1/(1 + 2pr) and yields u r 0) = 1/(1 + 2pr). It is also the case that x0) < x(l)> s ° both 
bounds are improved. From a practical perspective, the challenge of setting the step size to f3 is that 
r might not be known. However, under normal circumstances (and in the reference scenario) we have 
r = O(P), which can provide a general guideline for setting the step-size. 

6 Convergence Bound With Inconsistent Reads 

We now analyze a model without consistent read (without assumption I3.2I ). To show convergence, we must 
use a step size, so we analyze the following iteration: 

7j= (x*-XK(j),dj)A 
Xj+i = Xj + frjdj . 

The rate of convergence is slower, and the scalability is worse as the dependence on r is worse. 

Theorem 6.1. Consider iteration (1141 ) for some < /3 < 1 and an arbitrary starting vector xq, where 
do,di, . . .arei.i.d. vectors that take e^ 1 ' , . . . ,e^ n ' with equal probability, and K(0), K(l), . . . are such that 
equation © holds but are independent of the random choices of do, di, . . . . Let pi = max/ j- Ylr=i ^lr}- 
Provided that 2/3(1 — j3 — p 2 T 2 j3/2) < 1, the following holds: 

(a) For every m > , ' og( , 1/2) ■ , w %^3« we have 

V J — log(l — Amax/n) Amax 



(b) LetT ( 



o 



log(l/2) 



log(l-A max /n) 



and T = Tq + t. For every m > rT (r = 1,2, .. . ) we have 



Em < (1 - 2/3(1 -/3-p 2 r 2 /3/2) )(i _ 2/3(1 -/3-p 2 r 2 /3/2)(l-A max /nr + ^^ 



where 

P2"7- 3 /3 2 A max (l - \ max /n)- 2r 

x = 



■n 



Most of the proof is analogous to the proof of Theorem 14. II so we give only a sketch of the proof that focuses 
on the unique parts. 
Proof. (Sketch) As before: 



Xj+i - x*|| A = ||xj - x*|| A - /3(2 - P)(x K (j) ~ x *, d j)A - W{*K(j) - x*, dj) A (xj - x KU) ,dj) A . 
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We now bound the additional term: 



W{*K(j) -x*,d j ) A (x j -^ K{j) ,dj) A = 2/3(x^ (j) -x*,d,-) A ( ^ ^7td t ,d.,-) A 

= 2/? 2 (x i ^ (j) -x*,d j ) A ( ^ 7tdt,d i ) A 

teK-(i) 



> -/3 2 



( Xft:(i) - x*, dj)i + ( J] 7td *' d ^ 
-/3 2 [(x / , (i) -x*,d J ) A + 

|if-(j)| J^ (x Ji : (t )-x* ) d t )i(d t ,d i )i 



(15) 



> -/3 2 



( x ^(i) - x *' d i)i + r X] ( x *w ~ x *' d i)i( d i, d i)i 

tetf-tf) 



where K (j) = {0, . . . , j — 1} — if (j). Since 'X-Ktt) does not depend on d t or dj, we can bound as before, 

E [(x K{t) -x*,d t )i(d t) d i )i] < p 2 E [( XAr(t) -x*,d t )i] . 
Therefore, 

E j+1 < Ej - 2/3(1 - 0)E [{x m - x*, dj)i] + P2T/3 2 ]T E [(x* (t) - x*, d t ) A ] . 

After we unroll the recursion, we find that 

fe-l 
£ fc < E - 2/3(1 - /3 - p 2 r 2 /3/2) ]T E [(x K(i) - x*, d;) A ] . 



i=0 



We can now continue to bound as in the proof of Theorem 14. 1 1 The crucial observation is that x^m is the 



rl*WI 



result of \K(i)\ random single coordinate steps, so E [||x^) — x*|| A J > 6' ms ^ 'Eq > 5 l mSiX Eo. 



D 



Remark 6.2. One might ask why we did not simply adapt equation (fT3l ) for the inconsistent read iteration, and 
instead developed equation (fT5b . We could certainly adapt equation (fT"3T ). The problem is that if we follow 
that path, we get expressions of the form (xk(j) ~ x *> d j) A l( d *> d j)A| for t € K~(J). This expression is 
hard to analyze since X-k(j) can depend of d t . An example is K(j) = {0, . . . , j — 3, j — 1} and t = j — 2 
(for some j > 3). 

7 Related Work and Concluding Remarks 

Asynchronous methods were first suggested by Chazan and Miranker H in their pioneering paper on chaotic 
relaxation. The theory and application of asynchronous iterations has since been studied and used by many 
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authors. Noteworthy is the seminal text by Bertsekas and Tsitsiklis J2). A more recent review is by Frommer 
and Szyld (6]|. 

Historically, work on asynchronous methods focused on proving that the methods converge in the limit, 
and not on convergence rate analysis. In particular, the relation to the convergence rate of synchronous 
counterparts, and the scaling of these methods, were seldom studied. We are aware of only two exceptions, 
but the results are quite unsatisfactory. Baudet [1] proves a theorem that can be used to analyze the rate of 
convergence of asynchronous iterations of contracting operators, so they are applicable to the solution of only 
certain types of linear systems. In addition, his theorem can be used to analyze the convergence of a specific 
instantiation, and not the worst case behavior of an asynchronous method. Bertsekas and Tsitsiklis (2J 
Section 7.2, Exercise 1.2] prove a linear convergence rate of certain asynchronous iterations for some classes 
of matrices (like weakly diagonally dominant matrices), but analyze how the rate of convergence depends 
on the measure of asynchronism only under very restrictive conditions and in a hard to interpret manner E] 
Section 6.3.5]. 

Randomization is frequently used as an algorithmic tool in non-numerical asynchronous algorithms. 
However, until recently, it was not used as an algorithmic tool for asynchronous numerical computation (al- 
though, there is work on using asynchronous computation for inherently randomized methods like stochastic 
gradient descent flU). Recently, two new articles suggested asynchronous algorithms whose provable con- 
vergence rates rely on randomization. Freris and Zouzias [5] suggested the use of an asynchronous variant 
of randomized Kaczmarz ifTOTl to synchronize clocks in a wireless network. They analyze the convergence 
rate in a semi-asynchronous model that is suitable for wireless networks, but not for shared-memory numer- 
ical computations. Niu et al. [9] recently proposed and analyzed "Hogwild!", an asynchronous approach for 
parallelizing stochastic gradient descent. 

Our work is very much inspired by "Hogwild!", but we study a different problem and our proof technique 
is different. In fact, requiring atomic writes and consistent reads and consequently writing the iteration as 
equation © is a direct descendant of Niu et al.'s [9] analysis. However, our analysis makes several significant 
improvement over simply applying the "Hogwild!" analysis to the solution of linear systems (linear systems 
can be solved using stochastic gradient descent): 

1 . We treat parallelism in a much more principled manner; we explicitly lay out the assumptions made 
by the algorithm, and study different read models. In particular, we analyze both a consistent and an 
inconsistent read model, while Niu et al. analyze only a consistent read model. 

2. We prove a linear convergence rate, vs. a sub-linear convergence rate for "Hogwild!". 

3. The analysis of "Hogwild!" assumes bounded stochastic gradients. This assumption is not realistic 
when stochastic gradient descent is used to solve symmetric positive definite systems. 

4. Our dependence on r is much better: "Hogwildfs analysis has a 0(t a ) dependence. 

There are several questions that remain unanswered and are worthy of future work. Is that gap in the bound 
for consistent and inconsistent reads inherent, or an improved analysis will remove or narrow it? Is it possible 
to obtain comparable bounds when we allow k(j) or K{j) to depend on do, . . . , dj? In our reference sce- 
nario, we show weak-scaling only if we periodically synchronize the threads. Is the periodic synchronization 
essential, or is it an artifact of the analysis? 
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8 Appendix 

8.1 Proof of Lemma HZD 

Let B be the unique symmetric positive matrix such that A = B J . 

E[(x-x*,d)i] = E[E[(x-x*,d)i]|x] 



E 

1 

n 
1 

n 

i E 

n 

Ie 

n 



1 n 
±J>-x*, 



e i)i 



i=l 



E[||A(x-x*)|ffl 

E[(x-x*) T A 2 (x-x*)] 

(x-xfBAB(x-x* 

(x-xfBB(x-x*) 

(x-xfBAB(x-x* 



(x-x*) 1 BB(x-x*) 



(x-xfBB(x-x 
According to the Courant-Fischer theorem, for every vector y /Owe have 

y T Ay 



x _x Ha 



^min _ 



y T y 



<A r 



Applying the last inequality to the previous equality with y = B(x — x*) completes the proof. 

8.2 Proof of Equations © and CLU) 

We prove equation (fT2l ). Equation © follows by setting /3 = 1. 



.*||2 



l x j+i x Ha — 



|xj +/?7j d i - x *IIa 



x i - x *IIa + ll^idjUi + 2(x,- - x*,/3 7i d j ) A 

Xj - x*||i + /3 2 7 | + 2/?7i( x i - x *, dj)A 

x i " x *IIa + /3 2 ( x fc(i) " x *> d i)l " 2/5(x fe(j) - x*, d i ) A (x j - x*, dj)A 

x i - x *IIa + /3 2 ( x fc(j) - x *> d i)l 

-2/?(x fc(i) - x*, dj) A [(xj - x fe(j) , dj)A + ( x fc(i) - x *, dj)A] 

/3(2 - /3)(x fc(i) - x*, dj) A - 2/3(x fc(j) - x*, d i ) A (x j - x fc(i) , dj) A 



ll-ir- y*II 2 

— Il x i ~~ x Ha 



In the above we use the fact that A has unit diagonal, so (dj, dj) A = 1 for all i. 



8.3 Proof that E h > 6°ZtE n for b > a 



max o, 



We prove that Ej + \ > 5 max Ej for the iteration involving the step size ("X-j+i = Xj + Pjjdj, Section [5]), and 
that automatically implies the inequality for b > a and for (3 = 1 (which is the case considered in Section|4|. 
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First notice that 

x j+ i = xj + pijdj 

= Xj + j3(x* - x fc(i) , dj) A dj 

= Xj + (/3x* - Px k (j), dj)Adj 

= xj + (x* -y, dj) A dj 

where y = (1 — /3)x* + x fc y). Denote tj = (x* — y, dj) A . Now, 

ll x j+i - x 1a = ll x j + 7jdj - x*|| A 

= ll x i - x *IIa + Il7j d illi + 2 ( x i - x *)7j-dj)A 
= ||xj - x*|| A + 7? + 27j(xj - x*, dj) A 

= ||xj -x*|| A + (y - x*,dj) A - 2(y - x*,dj) A (xj - x*,dj) A 

= ||xj - x*|| A + (y - Xj + Xj - x*, dj) A - 2(y - x*, dj) A (xj - x*, dj) A 

= ll x j - x 1Ia + ( x j - x *, d i)l + (y - x j, dj)i 
+2(xj - x*, dj) A (y - Xj, dj) A 
-2(y - x*, dj) A ( X j - x*, dj) A 

= ll x j - x 1a + ( x j - x *, d i)l + (y - x j, d j)l 
-2(xj - x*, dj) A (xj - x*, dj) A 

= ll x i - x *lli - ( x j - x *> d i)l + (y - x j, d i)i 
> ll x ;- x li-( x i- x *, d ;)A- 

Taking expectation and applying Lemma |2~T1 (notice that y is independent of dj), we find that E [||xj + i — 
(l-A max /n)E[|| X j-x*|| A ]. 

8.4 Proof of Equation (O 

Let 

d = {rT + i-r<t<rT + i-l : t > rT} 

and 

Di = {rT + i-r<t<rT + i-l : t< rT} . 

Unrolling the recursion in equation (|7) starting at rT, we find that for r > 1 and w > 



x*ll 2 1 > 

x HaJ - 
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T-l+w T-l+w rT+i-l 

E(r+i)T+w < E rT -(l-pr) Y E[(x fc ( rT+i ) - x*, d r T+i)i] +P Y Y E [( Xfc W ~ x *> d *)i] 

i=0 i=0 t=k(rT+i) 

T-l+w T~l+w rT+i-l 

< E rT -(l-pr) Y E[(x fc(rT+i) -x*,d rT+i )i] +p J2 Y E[(x fc(f) -x*,d i )i] 

i=0 j=0 t=rT+i-r 

T-l+w T-l+w 

< E rT -(l-pr) Y E[(x fc(rT+i) -x*,d rT+i )i]+p ]T £)E[(x fc(t) -x*,d t )i] 

«=0 i=0 iGC*i 

T-l+w 

T-l+TO T-l 

< £ rT _(i-2pr) J^ E[(x fc(rT+i) -x*,d rT+i )i]+p^^E[(x fc(t) -x^,d t )i] . 

?=o i=o teA 

T-l+W T-l 

< £ rT _ (i _ 2 pr) Y E [( x fc(rT+i) - x*,d rT+i )i] + pY Y E [( x ^*) - x *> d *)i] • 

i=T i=0 teA 

The second-to-last inequality follows from the fact that each term of the form E [(x^m — x*, dj) A ] appears 

at most t times in p Yli=o +W YlteC- ^ [( x fe(t) ~~ x *; d *) a] ■ We a ^ so use tne ^ act tnat for i > r we trivially 
have Di = 0. 
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